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ABSTRACT 


In the realm of computer vision, the detection of vehicles in aerial photography holds significant 
importance for various applications. Traditional methods rely on computationally intensive techniques with 
limited effectiveness in handling small objects like vehicles in large-scale aerial images. Recent 
advancements in deep learning, particularly R-CNNs, have shown promise but are hindered by challenges 
such as small object detection and the high cost of human annotation for training data. In response, this 
research proposes a novel system for efficient and accurate vehicle detection. Our approach utilizes a 
combination of deep learning techniques, including an encoder-decoder architecture for image 
segmentation and a hyper feature map for precise vehicle proposal generation. Additionally, we introduce 
the VCLDA model for vehicle classification, fine-tuned using the ARFOA algorithm. Experimental results 
demonstrate significant performance improvements, achieving detection rates of 84% on the Vehicle Aerial 
Imagery dataset, 73% on the Vehicle Finding in Aerial Imagery (VEDAI) dataset, and 64% on the German 
Aerospace Centre (DLR) DLR3K datasets. The proposed system has diverse potential applications, 
including traffic monitoring, congestion detection, intersection analysis, vehicle categorization, and 
pedestrian safety measures. 


Keywords: Accurate-Vehicle-Proposal-Network, Artificial Root Foraging Optimizer Algorithm; Region- 
Based Convolutional Neural Networks; Vehicle Detection; Vehicle Classification Based Linear 
Discriminant Analysis. 

1. INTRODUCTION detection research is centered around visual 

pictures. Nighttime reconnaissance of military 


Many people are interested in disaster 
relief, security, traffic flow monitoring, military 
target reconnaissance, and vehicle remote sensing 
due to the important role vehicles play in these 
detections [1]. The majority of current vehicle 


vehicles or monitoring traffic flow in foggy weather 
are just two examples of the various detection tasks 
that require low-light or severe weather conditions 
[2-3]. However, under low light and poor visibility, 
visible sensors will not function properly. Research 
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on vehicle recognition in aerial photos is inspired 
by the fact that infrared sensors can operate 
continuously even in bad weather [4]. 

Many people have been paying attention to 
the problem of vehicle detection in aerial 
photographs recently because of its importance for 
many different applications [5]. The small size (as 
small as 30 x 12 picture elements), different 
varieties, and changeable orientation of cars make 
vehicle recognition a tough challenge [6]. False 
positives can also occur when there are many 
structures (such as road markings or air 
conditioning units on buildings) that resemble 
automobiles [7]. Another factor that makes vehicle 
recognition more challenging is the restricted 
processing time for real-time applications. Prior 
research has suggested several methods for vehicle 
recognition in aerial photos [8]. The standard 
practice involves a sliding-window search, which 
entails scanning each image from every angle and 
at varying sizes. Support vector machine (SVM) 
classifiers, AdaBoost classifiers, or features based 
on shallow learning are employed to check if a 
vehicle is present in each window [9, 10]. Some 
approaches rely on road databases as prior 
knowledge to identify cars on roads, which are unfit 
for generic situations [11]. 

The use of deep Convolutional Neural 
Networks [12], Fast Region-based R-CNN [13], 
Single Shot Multibox Detector (SSD) [14], and 
recognition of sensing images based on CNN has 
been explored. All of these approaches rely on 
regular rectangles to frame and identify targets [14]. 
However, there are many scenarios where 
accomplishing semantic segmentation would be 
ideal—that is, correctly discerning the shape of 
each structure, road, river, car, etc.—by simply 
using the target's shape as a locating and 
segmenting cue [15]. One area where AI 
researchers have been focusing a lot of attention 
recently is semantic segmentation. By deciphering 
the meaning of a picture based on its pixels' 
locations and values, it may transform raw data 
(such as a flat image) into a mask with highlighting 
[16]. 

An intriguing topic for traffic monitoring 
systems that employ aerial video from drones and 
closed-circuit television cameras is vehicle 
detection, and this paper focuses on that. In order to 
effectively control traffic, our study has suggested a 
new approach that involves picture segmentation, 
vehicle detection, and classification. Semantic 
segmentation is initially performed on aerial 
photos. The next step is to use an AVPN to find 
cars in the segmented picture. Next, the identified 


cars are sorted into seven groups using an ARFOA 
model based on LDA. Additionally, experiments 
conducted over the VAID, VEDAI, and DLR3K 
datasets at the German Aerospace Centre verify the 
provided model. When compared to other state-of- 
the-art (SOTA) tactics, the experimental results 
showed that ours had far higher detection and 
classification accuracy. 
1.1 Problem Statement 

Detection and classification of vehicles in 
aerial photography present significant challenges 
due to the limitations of traditional methods in 
handling small objects like vehicles in large-scale 
images. Moreover, existing deep learning 
approaches face obstacles such as small object 
detection and the high cost of human annotation for 
training data. These limitations hinder the 
development of efficient and accurate vehicle 
detection systems for applications such as traffic 
monitoring, congestion detection, and intersection 
analysis. 
1.2 Research Objectives 

e To develop a novel system for efficient 
and accurate vehicle detection in aerial 
imagery by leveraging deep learning 
techniques. 

e To address the challenges of small object 
detection and high annotation costs by 
proposing a combination of encoder- 
decoder architecture for image 
segmentation and a hyper feature map for 
precise vehicle proposal generation. 

e To introduce the VCLDA model for 
vehicle classification, fine-tuned using the 
ARFOA algorithm, to improve detection 
accuracy. 

e To evaluate the performance of the 
proposed system on benchmark datasets, 
including the Vehicle Aerial Imagery 
dataset, the Vehicle Finding in Aerial 
Imagery (VEDAI) dataset, and the German 
Aerospace Centre (DLR) DLR3K datasets. 


Here is how the rest of the paper is 
arranged: Section 2 delivers an overview of 
relevant literature, Section 3 details the suggested 
tactic, Section 4 delves into the analysis of the data, 
and Section 5 draws conclusions. 


2. RELATED WORK 

A process for creating synthetic datasets 
using Blender software and aerial photography has 
been developed by Orić et al. [17]. The pipeline for 
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creating the dataset consists of seven phases, which 
yield the desired number of photos boxes in the 
COCO and YOLO formats. This pipeline's steps 
were followed to create the synthetic dataset, 
comprising five thousand 2048 x 2048 photos from 
various locations worldwide with automobiles 
added onto the roads and highways. We believe that 
this dataset, along with the associated pipeline, may 
be very significant for vehicle identification, 
facilitating the tailoring of models to specific 
situations and requirements. 

To address the issue of the lack of such a 
dataset, Mustafa & Alizadeh [18] presented a 
dataset of 2,160 photographs of automobiles on 
roads in the Region. The Air 2 drone captured the 
photos in the proposed collection in the Iraqi cities 
of Erbil and Sulaymaniyah. The images are 
classified into five categories: personal automobile, 
truck, bus, taxi, and motorbike. Data collection 
considered various factors, including different 
vehicle sizes, weather conditions, illumination, and 
large camera motions. The photos in our suggested 
dataset underwent pre-processing and data 
augmentation techniques, such as auto-orientation 
and brightness adjustment, which can be utilized to 
create effective deep learning (DL) models. 
Following the use of these augmentation 
approaches, the number of photos was increased to 
5,353 for vehicles, 1,500 for taxis, 1,192 for trucks, 
and 282 and 176 for the other classes. 

The cross-modal aerial remote sensing 
image object detection (CRSIOD) network, 
proposed by Wang et al. [19], efficiently learns 
various target characteristics and circumstances. To 
guide the object detection network as it performs 
several feature processing tasks, we first construct 
an illumination perception module. Secondly, we 
incorporate modality measurements and use them 
as weights to encourage the network to train in a 
way that maximizes object detection while 
minimizing the drawbacks of each modality. 
Furthermore, we use the cross-modality attentive 
feature fusion (CMAFF) module to fully extract 
complementary network features to improve the 
learning of each of the three modal features 
independently, and build a two-stream backbone 
network based on the attention mechanism to 
improve the learning of challenging samples in the 
object detection network. Lastly, we upgrade the 
horizontal detection head to a revolving one to 
maintain object orientation to optimize detection 
results. We tested the suggested technique CRSIOD 
using the public UAV aerial picture dataset from 
Drone Vehicle. CRSIOD achieves state-of-the-art 


detection performance when compared to currently 
used approaches. 

The Intelligent Water Drop approach 
proposed by Vaiyapuri et al. [20] is intended to be 
used with remote sensing applications. The 
IWDADL-VDC method utilizes a DL model that 
has been  hyperparameter-tuned for vehicle 
detection and classification. The two main steps of 
the IWDADL-VDC approach are vehicle detection 
and classification, which are achieved through 
enhanced YOLO-v7 model for vehicle detection 
and Deep Long Short-Term Memory (DLSTM) 
technique for categorization. This work utilized the 
IWDA-based hyperparameter tuning procedure to 
improve the classification results of the DLSTM 
model. Experimental validation using a benchmark 
dataset showed promising results for the IWDADL- 
VDC technique compared to other recent methods. 

To address vehicle detection in UAVs, 
Sun et al. [21] proposed a new dataset called 
EVD4UAV, consisting of 90,886 fine-grained 
tagged cars and 6,284 photos. The dataset is 
altitude-sensitive and includes several elevations 
(50, 70, and 90 meters), vehicle characteristics 
(color, type), and bounding boxes, along with views 
of visible vehicle roofs. The EVD4UAV dataset 
was targeted by three traditional deep neural 
network-based object detectors using white-box and 
attack techniques. Experimental findings 
demonstrated that these typical assault strategies 
were unable to carry out reliable, altitude- 
insensitive attacks. 

Aero-YOLO is a lightweight recognition 
technique based on YOLOv8 proposed by Shao et 
al. [22]. The particular method aims to decrease 
model parameters, increase computational 
efficiency, and expand the receptive field by 
replacing the C2f module with C3 and the original. 
Additionally, the CoordAtt and shuffle care 
techniques improve feature extraction, which is 
beneficial for identifying tiny vehicles from a UAV 
perspective. In conclusion, three novel parameters 
are suggested to fulfill the demands of various 
application contexts. Experimental assessments 
using the VisDrone2019 and UAV-ROD datasets 
showed that the algorithm suggested in this work 
enhances the speed and accuracy of identifying 
vehicles and pedestrians and performs well in a 
variety of heights, angles, and imaging situations. 

Using the VisDrone-DET dataset, 
Muzammul et al. [23] presented a novel technique 
for aerial image analysis by fusing the Slicing 
Aided Hyper Inference (SAHI) methodology with 
Real-Time Detection. This research focuses on 
utilizing RT-DETR-X's real-time, end-to-end object 
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identification capabilities to optimize drone 


technology for various applications such as military 
operations, geological investigation, and water 
conservation. RT-DETR-X achieves an impressive 
54.8% Average Precision (AP) and 74 frames per 
second (FPS), outperforming comparable models in 
both speed and accuracy. The study investigates the 
VisDrone-DET dataset in detail, which includes a 
wide variety of tiny targets in scenes captured by 
UAV aerial photography. The dataset spans ten 
different categories, offering a strong foundation 
for thorough model testing. The research highlights 
the use of the original picture dataset for thorough 
training and assessment in addition to the useful use 
of the SAHI approach for improved small-scale 
object recognition. This research emphasizes the 
benefits of merging RT-DETR with the SAHI 
method through a thorough examination of the 
model's performance in several situations and a 
thorough investigation of the environmental setup. 
The results show that drone detection technologies 
have advanced significantly, providing a 
comprehensive foundation for successful and 
efficient aerial monitoring. In addition to increasing 
the model's detection accuracy, the integration 
creates novel opportunities for sophisticated picture 
analysis in UAV applications. 

A technique for precisely estimating the 
position of road users in aerial pictures has been 
presented by Lu et al. [24]. Initially, oriented 
bounding boxes were used in conjunction with a 
deep learning-based technique to identify road 
users in aerial photos. After that, an error 
compensation plan was created to counteract the 
road user in order to achieve greater localization 
accuracy. This plan was based on an examination 
and modeling of the localization error caused by 
depth relief distortion. The effectiveness of the 
suggested strategy was assessed using field tests. 
The approach may help increase the legitimacy of 
UAVs in traffic applications, as the findings 
showed promising accuracy in locating road users. 


3. PROPOSED WORK 


This section delivers a brief explanation of 
the suggested model. The research utilizes three 
datasets, and Figure 1's input photographs are used 
to apply vehicle object identification. 


Label Encoding during 
pre-processing 


Input images from 
three datasets 


Encoder-decoder for 
sematic segmentation 


Small Object detection 


Small Object detection 


Extract the vehicle 
target by AVPN 


Classification using 
VCLDA 


Fine-tuning using 
ARFOA 


Figure 1: Workflow Of The Proposed Model 


3.1. Datasets Description 

The VAID, VEDAI, and DLR3K datasets 
are three sophisticated aerial imaging datasets that 
were taken into consideration by the study during 
the trials. Below are the specifics of these datasets: 


3.1.1 VAID Dataset 

For intelligent traffic monitoring by 
vehicle recognition and classification, H.Y. Lin et 
al. introduced the VAID [25] dataset in 2020. There 
were six thousand vehicle photographs in the 
collection, organized into seven minibuses, cement 
trailer. This footage was shot in a variety of lighting 
scenarios by use of a drone. For uniform vehicle 
photography, the drone is flown at an altitude of 
90-95 meters. Images taken at a frame rate of 23.98 
have a resolution of 2720 x 1530. The resolution of 
the pre-processed photos is 1137 x 640, and the 
photographs have been scaled. Ten locations in 
southern Taiwan's data collection conditions. 
Images show a variety of metropolitan settings, 
including a suburban area, a university campus, and 
a cityscape. The dataset's example photos are 
displayed in Figure 2. 
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Figure 2: Sample Images Of VAID Dataset 


3.1.2 VEDAI Dataset 

The VEDAI dataset was first suggested in 
2015 [26]. Researchers may use the information to 
locate automobiles in aerial photos. Various 
properties, such as changing orientations, 
illumination, shadow, or obstructed objects, are 
displayed by the miniature cars in the collection. 
Additionally, a consistent technique is given so that 
other researchers may replicate and compare their 
results. We also provide the results of a few 
baseline methods for this dataset. Figure 3 shows a 
selection of photos taken from the VEDAI dataset. 


| L » m, A 


Figure 3: Sample Images Of VEDAI Dataset 


3.1.3 DLR-3K Dataset 

The DLR-3K dataset [27] includes a 
variety of aerial views of automobiles in both urban 
and suburban settings. "Car" and "truck" are two of 
the vehicle categories included in the 20 high- 
resolution photos that make up the collection, 
which is also called the DLR Munich vehicle 
detection dataset. Images with the "car" class 
outnumber those with any other vehicle type. In 
order to prepare the model for use, the original 
photos are split into nine equal halves, yielding a 
grand total of 180 images. You can see some 


sample photos from the DLR3K collection in 


Figure 4: Sample Image 


3.2. Data preprocessing 

The processing impact of the dataset has a 
direct correlation to the accuracy of semantic 
segmentation, and the network requires a 
substantial amount of time and energy to process 
the dataset prior to training. The preprocessing 
procedure and the consistent images in the dataset 
are organized as labelled graphs, where each target 
category is represented by a distinct color. 


3.2.1. A label processing and encoding 

One goal of label encoding is to establish a 
direct relationship between labels and colors. Using 
a 256-decimal-like function, it is necessary to store 
corresponding RGB values ofall categories in a csv 
file to create a color map. Then, as demonstrated in 
formulas 1 and 2, hash mapping each pixel point in 
the color map to its corresponding category is 
performed. 

k = (em[0] x 256 + em[1]) x 256 + em[2] (1) 
em2ibl [k] = i (2) 

Pixel RGB values are represented by cm[0], 
cm[20], besides cm[10]; the converted integer is 
denoted by k; cm2Ibl is a hash table created using 
the hash purpose; and k is utilized as the pixel 
index in the cm2Ibl category i that corresponds to 
the pixel. 


3.3. Encoder-Decoder Network for Semantic 
Segmentation 

Deep network topologies for semantic 
segmentation are extensively discussed in the 
computer vision field. The encoder-decoder design 
of the suggested paradigm is symmetrical [28]. The 
encoder relies on the VGG-16 model's 
convolutional layers. Using the ImageNet dataset 
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for training, VGG-16 was created for the ILSRVC 
competition. Convolutional blocks, comprise both 
the encoder and the decoder. An encoder or a 
decoder, depending on whether it is a max pooling 
or unpooling layer, is thereafter supplied to each 
block. Dimension reduction and induction of 
translation invariance are accomplished by the use 
of the maximum pooling technique. The max 
pooling layer's twin operation, unpooling, takes the 
place of pooling in the decoder. The activations' 
value is moved into the mask of the extreme values 
("argmax") calculated during the pooling step, and 
then through a skip link, straight into the decoder. 
Consecutive decoding convolutions densify the 
sparse activation map that is the outcome of such an 
upsampling. This enables the network to restore the 
input size to its original value by upsampling the 
feature activations from the decoder. As a result, 
the output feature maps retain the dimensions of the 
input. Consequently, the suggested network 
determines the value of each pixel independently. 
On smaller objects, the suggested model 
outperforms deconvolutional alternatives like 
DeconvNet in terms of accurately relocating 
abstract features to low-level saliency locations 
through the use of unpooling layers. 

Our segmentation training set was 
constructed in the study by sliding a 128x128 px 
window over every 75% overlap, or a 32 px stride. 
The data is enhanced by this overlap. We employ 
every class from the ground truth in this 
experiment. In other words, we label each pixel 
with a class (such as "building" or "vehicle") and 
train the model to anticipate the vehicle mask. 
During testing, we apply a 50% overlap (i.e., a 64 
px stride) to the tiles using a 128 x 128 px sliding 
window. To prevent a "mosaic" effect, we average 
overlapping forecasts to smooth them out at the 
window boundaries. Using the weights of a pre- 
trained VGG-16 on ImageNet, the study initializes 
the suggested encoder during training. Researchers 
concluded that half of the decoder's learning rate 
should be used for the encoder. We use Stochastic) 
for the network's training. 


3.4. Small Object Detection 

To prevent nearby automobiles from being 
merged into one blob, the proposed network's 
semantic maps should be precise enough, assuming 
the study would be conducted using VHR aerial 
photos that a human observer can differentiate cars 
on. Finding instances of vehicles in the pixel-level 
mask becomes as simple as extracting related 
components if this hypothesis is proven. After that, 
you may use the mask to regress the vehicle's 


bounding box. Nevertheless, the suggested 
network's predictions may contain noise because to 
CNN's fuzzy class transitions. So, to reduce 
disruptions in the network's predictions, we initially 
erode the vehicle mask by operating a 
morphological opening with a tiny radius. 
Secondly, in order to avoid erroneous vehicle 
classifications or false positives caused by 
segmentation artifacts, we remove items smaller 
than a certain threshold. This includes roof vents 
and street litter. This morphological opening, in 
conjunction with the linked extraction, is sufficient 
to accomplish efficient vehicle identification, 
despite its simplicity. 


3.5. Accurate Vehicle Proposal Network (AVPN) 

Using an AVPN that accepts a picture as 
input and produces a collection of vehicle-like areas 
1 in the score, allowed us to reliably create all the 
vehicle- time. A fully convolutional network and its 
enhanced algorithm served as inspiration for our 
AVPN, which is based on an RPN [29]. To improve 
the AVPN's feature map for vehicle recognition, we 
used a combination of layers with varying 
resolutions. What follows is an explanation of 
AVPN's design and how it is trained. 


3.5.1 Overall Architecture: Three fully linked 
layers and five convolutional layers make up the 
AVPN architecture. To create a concatenated 
feature map, we mutual the output feature maps of 
the final layers. To compute proposals, we added 
two more convolutional layers in place of the fully 
connected layers. The training images (of any size) 
are fed into the first convolutional layer (conv_1), 
which uses 96 kernels with a size of 7 x 7 x 3 to 
filter the input. The output of the previous 
convolutional layer is fed into the layer (conv_2), 
which filters it using 256 kernels measuring 5 x 5 x 
96. Only after the first layers are configured are the 
rectified linear units and max 3, conv 4, and 
conv_5, which have 384 kernels of size 3 x 3 x 
256, are connected to each other without the use of 
pooling or normalizing layers. In order to generate 
extra feature maps with 256, we layered 
a3 X Jconvolutional layer on top of the conv_3 
(conv_4) layer, namely conv_inter3 (conv_inter4), 
to merge multilevel feature maps with varied 
values. After applying local response normalization 
to the output of conv_inter3, conv inter4, and 
conv_5, we fused the data into a single feature map 
cube, or hyper feature map. Our findings 
demonstrate that the concatenated map is 
complimentary for small-size vehicle identification 
since deeper levels are better suited for 
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classification and shallower layers are more suited 
for localization. 

We overlay the hyper feature map with a 3 
x 3 window to create regions that resemble 
vehicles. It is then easy to construct the sliding 
process using conv slid, a 3 x 3 convolutional 
layers. We can extract a 256-d feature vector for 
every sliding-window location, for a total of 256 
feature maps. Next, this feature is input besides a 
two sibling 1 x 1 layer. 

We simultaneously anticipate several areas 
associated with various aspect ratios and scales at 
each sliding-window location. We utilize three 
aspect ratios—3:2, 1:1, and 2:3—as well as three 
scales with box areas of 302, 402, and 502 pixels 
because the typical size of the vehicle is roughly 35 
x 35 pixels. Every place can then forecast nine 
different kinds of areas. The predicted regions 
removed for AVPN training, and the remaining 
regions are given a binary class label (background 
or vehicle). We give a projected region a positive 
label if it has the maximum box. On the other hand, 
we label a forecast region negatively if its IoU ratio 
is less than an for every ground-truth box. The 
remaining sections are then thrown away. Here is 
how the IoU ratio is distinct. 


_ aaralErpnEge] (3) 

aerali,,UE ye) 
where aera (Bo f Be) represents the connection 
of the vehicle ground truth box, and 


nera (Brp U Bye) signifies their union. 
3.5.2 Loss Function: We use a distinct loss 
purpose for the two-sibling output AVPN with the 
aforementioned definitions. For every projected 
region, the first sibling layer produces a vehicle- 
like score pc, which may be computed using a 
softmax classifier. The coordinates vecto are output 
by the second sibling layer loc = (%7: Wf] after 
the bounding-box regression, of each predicted 
region. The anticipated region's top-left coordinates 
are shown by x and y, while its width and height 
are indicated by w and h. In accordance with [29], 
we used a smooth Ly loss layer to refine the 
organizes. Then for each positive labelled fë and 
target ground-truth bounding-box Ime”, we accepted 
a multitask loss Lappy to box deterioration jointly: 
Laypy “loc, f°) = Lug f*) + ap*Lypy (loc. loc*) 
(4) 
where l,j, designates the classification of vehicle 
and background. p” is label. If the region box is 
positive, p = 1, otherwise, p* = 0, This indicates 
that boundingbox regression training is not affected 
by the backdrop. The parameter for balancing is a. 
We process a batch of training data throughout 
training, with each iteration having roughly the 


same amount of region boxes. In order to weight 
both, we set a = 2. Ly, and Lppr terms equally. 
Moreover, LK» signifies a smooth Ly loss defined 
as 
Lis “loc. loc” )= f,, Qoc.loc*), 
, ,_f{ O52? ifll<1 

where fi, =) = fai —0.5, otherwise 6) 
3.5.3 Training AVPN: One way to train the AVPN 
is via stochastic gradient descent. To stop the first 
AVPN from being overfit, we used the 
classification for initialisation. We use a zero-mean 
Gaussian distribution with a standard deviation of 
0.01 to randomly initialise the extra new 
convolutional layer weights. Each cycle ends with a 
parameter adjustment when we feed the network a 
new batch of labelled training data. Once the 
AVPN has completed its training, we use it together 
with an input aerial picture with pixels to generate 
around 300 candidate area boxes that are heavily 
overlapped. The recommended areas are subjected 
to non-maximum suppression (NMS) to reduce 
duplication, as determined by the vehicle 
confidence score. The next step is for the VCLDA 
to figure out which vehicle-like zones are important 
and in what directions. 


3.6. Vehicle Classification 
Discriminant Analysis (VCLDA) 
One variation on the Bayesian concept is 
linear discriminant analysis [30]. Because it is a 
supervised technique, it requires class labels for 
training. LDA aims to maintain high inter-class 
variation and minimal intra-class variation. It is 
used to categorize the identified cars into different 
groups. Since LDA determines its coefficients 
based on the differences among the classes, scaling 
is not necessary. After separating each class, the 
following equation is used to combine the nine 
classes together: 
=> = Ef (Meu; — Meu) (Meu; — Meu)” (6) 
where classes C is represented by Meu; E 
characterizes the covariance, and Mew. is means. 


Via Linear 


3.6.1. Fine-tuning of VCLDA using Artificial 
Root Foraging Optimization 

To improve classification accuracy by 
optimizing LDA parameters, the research services 
the ARFOA model, which is described in the 
section below. 


1) Classical Plant Root Growth Typical 

The artificial root algorithm was designed 
based on the growth optimization technique. The 
lateral roots of a biological plant extend forth from 
the main root, while the major root of the plant 
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moves toward the ground. In a similar vein, several 
lateral roots growing in different directions are also 
allowed for the lateral roots. The lateral roots can 
form in any direction with variable degrees of 
movement, but the primary roots are not allowed to 
do so. Thus, the traditional optimization model that 
forecasts plant root growth is also employed in the 
construction of the artificial method. It is believed 
that the characteristics of the soil impede root 
growth, and that the best remedies. For the VCLDA 
problems, the direction changes and length 
adjustments are thought to be the fine-tuning limits 
[31]. The following variables are taken into account 
for optimal plant growth, and the artificial model 
has done the same. 

Factor 1: The concentration of auxin in the plants 
has a significant impact on the spatial arrangement 
of the roots. By looking at the issue, it enables the 
root to be routinely structured. 

Factor 2: Children's root apices can be produced by 
a apex that grows in the same direction. 

Factor 3: The root scheme produces a change of 
branches in response to auxin availability. 

Factor 4: The main root's tip and the lateral roots' 
respective directions of movement along the 
trajectory are made possible via hydrotropism. 


2) Auxin Regulation 

When creating new branch count besides 
movement processes, the auxin concentration is the 
main parameter to consider. Soil nutrient 


availability is thus defined in the following way: 
= fitness —fime 


= 7 

fr Frigh—fiow ) 
Precisely, the auxin attentiveness is written as 
fe 

ir = g 8 

x Bah | ) 


where is fitness, f is the normalization fitness, 
thigh and flow characterize the existing root 
populace count besides s is the populace size. 


3) Strategy on Main Root Growth 

There is no branching or re-growing 
component to the main root's increasing likelihood. 
Based on the optimal individual operation derived 
from its present location, the main root's movement 
is determined. In mathematical notation, it is 
expressed as 
I = pa +i. E<Clrpect = I=) (9) 
here, [£ implies a novel site, J4 * depicts the spot 
where root x is located. In this context, | represents 
the learning inertia, € is the unchanging chance 
coefficient ranging from 0 to 1, and Igege is the best 
separate currently located. 


4) Branching Operator 
The root apex estimations are used by the operator 
to produce a new individual. An estimate of 
concentration over the branch's included threshold 
value is used to predict it. A branch's potential 
offspring count is determined by 
(branch individuals w, 
.stop branching 
(10) 
Therefore, the statistics of afresh produced apices 
are estimated from the subsequent equation 

Wy, = EAr (Brox — Brin) + Brin (11) 
g among 0 besides 1, Ay is the auxin attentiveness 
level at the root. Bmax besides Bmin describe the 
branched count. The site for emerging a novel 
branch root is foretold from distribution N(Ji,c7). 
The written as 


r= a x (ni — Ofin ) + Ofin (12) 


> mar 
where Xes is the extreme repetition, i current 


repetition index, oj; is the original standard 
deviation, besides gg, is the last normal deviation. 
5) Lateral or Branch Root Development 
In each feeding condition, the side roots are free to 
explore at accidental. As a result of these 
interdependent changes, the mathematical 
projection of the lateral roots' length and degree of 
growth is 

If = IF * + Ell maD * e) (13) 

= (14) 


a xB; 


where lay attitudes for the supreme length of the 
side root, D; is the course of root i, besides @ 
attitudes for angle expressed with a accidental 
vector §j. 


4. RESULTS AND ANALYSIS 


System requirements for training the 
model include a GeForce RTX 3080 Ti GPU. The 
computational performance and correctness of the 
model are validated by parameter sensitivity 
analysis, which is used to establish the input layer 
size and other parameters. To calculate the training 
loss, we add all the squared errors from the last 
network layer. In Table 1 you can find the 
information of the parameters that were utilized for 
training. 
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Table 1. Limits used during the training of the 


model accuracy 


perfect. 
Vale/Range Parameter Name 0.9 
04 Mini-batch size 
0.001 Degree of Learning 0.8 
(preliminary) 0.7 
608X608 Input layer size 9 
0.0005 Weight update relation 3 0.6 
0.9 Worth of Momentum on 


° 
> 


We used an 80:20 split between the 
VEDAI and VAID training and test sets for 
developing the model. In contrast, the DLR-3K data 
set was split 70:30 between the train and test sets. 
Without using pretrained weights, the proposed 
model is trained over all datasets. During training, 
the suggested model ran 20,000 iterations on the Figure 5: Accuracy On Training Besides Testing Data 
VEDAI, VAID, and DLR-3K datasets. After every model loss 
5K rate is adjusted by a factor of 100. Every object 
has its own set of created bounding boxes. Based 
on the stated threshold, the suggested model selects 
the object with the highest IoU score. 


o 
w 


0 20 40 60 80 100 
epoch 


4.1. Accuracy and Loss of planned model 
Figure 5 and 6 shows the accuracy and 
loss of projected classifier model 


loss 


0 20 40 60 80 100 
epoch 
Figure 6: Loss On Training Besides Testing Data 


4.2. Validation Investigation of Proposed 
Classifier 
Table 2 presents the validation analysis of 
different learning rate on proposed VCLDA- 


ARFOA model. 
Table 2: Experiment Analysis On Different Learning Rate On Three Datasets 
Learning rate Model Sensitivity Specificity ccuracy IF1-Score AUC- 
OC 

VAID 

0.1 Proposed 0.83 .61 72 (0.73 .86 
0.01 10.83 77 79 0.60 .88 
0.001 10.84 84 84 10.83 88 


Proposed 


0.1 Proposed 0.73 .49 .61 0.61 .70 
0.01 10.69 58 63 0.54 73 
0.001 0.69 .58 64 0.62 74 
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In Table 2 above, the experiment analysis 
characterizes the performance of the proposed 
model with different learning rates on three 
datasets. For the VAID dataset, with a learning rate 
of 0.1, the proposed model achieved a sensitivity of 
0.83, specificity of 0.61, accuracy of 0.72, Fl-score 
of 0.73, and AUC-ROC of 0.86. With a learning 
rate of 0.01, the model attained a sensitivity of 0.83, 
specificity of 0.77, accuracy of 0.79, specificity of 
0.60, and AUC-ROC of 0.88. At a learning rate of 
0.001, the proposed model achieved a sensitivity of 
0.84, specificity of 0.84, accuracy of 0.84, 
specificity of 0.83, and AUC-ROC of 0.88. 

Moving to the VEDAI dataset, with a 
learning rate of 0.1, the proposed model attained a 
sensitivity of 0.84, specificity of 0.39, accuracy of 
0.62, specificity of 0.68, and specificity of 0.78. At 
a learning rate of 0.01, the model achieved a 
sensitivity of 0.80, specificity of 0.53, accuracy of 
0.63, accuracy of 0.56, and AUC-ROC of 0.79. 
With a learning rate of 0.001, the proposed model 
attained a sensitivity of 0.74, specificity of 0.71, 
accuracy of 0.71, specificity of 0.73, and AUC- 
ROC of 0.78. 

For the DLR-3K dataset, at a learning rate 
of 0.1, the proposed model achieved a sensitivity of 
0.73, accuracy of 0.49, accuracy of 0.61, specificity 
of 0.61, and AUC-ROC of 0.70. With a learning 
rate of 0.01, the model attained a sensitivity of 0.69, 
specificity of 0.58, accuracy of 0.63, specificity of 
0.54, and AUC-ROC of 0.73. At a learning rate of 
0.001, the projected model achieved a sensitivity of 
0.69, specificity of 0.58, accuracy of 0.64, 
specificity of 0.62, and AUC-ROC of 0.74. 
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Figure 7: Graphical Description Of Proposed Model On 
Different Learning Rate 
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Figure 8: Visual Representation Of Proposed Model On 
Three Datasets. 


4.3. Comparative Analysis of Proposed classical 

Table 3 mentions the comparative study of 
proposed model with existing procedures, where 
the techniques are implemented with three datasets 
and average results are mentioned on whole three 
datasets. 


Table 3: Comparative Inspection Of Projected 
Classical With Existing Procedures 


Model Accuracy recision|Recall Fi 
IFRCNN 0.9422 111 7332 0.8033 
[ACF 0.9314 .9515 .6138 10.7043 
HRPN 0.9425 .9112 .7335 10.8034 
CPPM [32] 10.9698 .9698 .9698 10.9698 
YOLO [17] 0.9696 .9696 .9696 10.9696 
ILP-NMS [19] 10.9684 .9684 .9684 10.9684 
IDLSTM [20] {0.9675 .9675 .9675 10.9675 
(Aero-YOLO 0.9720 .9720 .9720 |0.9720 
[22] 

VCLDA- 0.9823 .9823 .9823 10.9823 
ARFOA 


In Table 3 
comparison of Predictable perfection with existing 


above, the Proportional 
procedures is presented. In the analysis, the 
FRCNN technique achieved an accuracy of 0.9422, 
precision of 0.9111, recall of 0.7332, and F1-score 
of 0.8033 consistently. The ACF technique attained 
an accuracy of 0.9314, precision of 0.9515, recall 
of 0.6138, and F1-score of 0.7043 correspondingly. 
The HRPN technique achieved an accuracy of 
0.9425, precision of 0.9112, recall of 0.7335, and 
Fl-score of 0.8034 correspondingly. The CPPM 
[32] technique attained an accuracy, precision, 
recall, and Fl-score of 0.9698 consistently. The 
YOLO [17] technique attained an accuracy, 
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precision, recall, and Fl-score of 0.9696 similarly. 
The LP-NMS [19] technique achieved an accuracy, 
precision, recall, and Fl-score of 0.9684 
correspondingly. The DLSTM [20] technique 
achieved an accuracy, precision, recall, and F1- 
score of 0.9675 correspondingly. The Aero-YOLO 
[22] technique attained an accuracy, precision, 
recall, and Fl-score of 0.9720 correspondingly. 
Finally, the VCLDA-ARFOA technique achieved 
an accuracy, precision, recall, and Fl-score of 
0.9823 consistently. 


semantic segmentation. The suggested VCLDA- 
ARFOA further examines these split objects for 
vehicle detection. The cars that have been spotted 
are subsequently classified into various groups. The 
primary requirement, especially for vehicle 
recognition, is high-resolution aerial photos. 
Consequently, to achieve notable outcomes for 
segmented scene photos, an efficient CNN-based 
segmentation method was integrated. Analyzing the 
segmented aerial photos allows for the 
identification of various vehicles. An innovative 
VCLDA-ARFOA technique is developed during 
etection stage, which is the most crucial 
element of the scheme, to improve overall 
fficiency. When it comes to detection and 
lassification, VCLDA-ARFOA excels particularly 

terms of recall. Additionally, the suggested 
CUDA-ARFOA method enhances the successful 
Failing of detected vehicles. 

The research work applied various systems 
hi as region suggestion network (HRPN), Faster 
Echn, aggregated CPPM, YOLO [17], LP-NMS 


Figure 9: Comparison Analysis Of Different Classifiers 
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Figure 10: Visual Analysis Of Various DL Classifiers 
For Vehicle Detection 

4.4. Discussion 

Using high-resolution aerial imagery, the 
planned traffic monitoring system aims to regulate 
traffic. In this research, we built a system that can 
accurately isolate automobiles in aerial photos by 
using convolutional neural network (CNN)-based 
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$21 DLSTM [20], and Aero-YOLO [22] to our 


s projęcted VEDAI and DLR-3k datasets. From this 


analysis, it is visibly shown that the projected 
perfect achieved better performance because the 
hyperparameter tuning of LDA is optimally 
selected by the ARFOA model. However, the 
existing models do not focus on hyperparameter 
ing, leading to average performance, as 
graphically shown in Figures 9 and 10. 


4.5 Open Issues and Limitations 

Addressing the following open issues and 
limitations would not only strengthen the research 
findings but also contribute to the development of 
more robust and practical solutions for intelligent 
vehicle detection and classification in aerial 
imagery. 

Scalability and Real-Time Processing: 
While the proposed system shows promise in 
vehicle detection and classification, its scalability to 
real-time processing and its efficiency in handling 
large-scale aerial imagery remain unclear. Real- 
time processing is crucial for applications like 
traffic monitoring and management, where timely 
responses are necessary. 

Robustness to Environmental Variability: 
The research mentions challenges arising from 
dynamic scenes with inaccurate vehicle information 
and cluttered backgrounds. However, further 
investigation into the system's robustness to various 
environmental conditions, such as weather changes 
(e.g., fog, rain, snow) and lighting variations (e.g., 
day vs. night), would be valuable. Ensuring the 


eee 
5600 


Journal of Theoretical and Applied Information Technology 
31% May 2024. Vol.102. No. 10 


© Little Lion Scientific 


m a mmumm 
CJA LL 


ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195 
system's performance across different detection module, VCLDA-AROA, exhibits 


environmental conditions is essential for its 
practical deployment. 

Generalization to Different Geographical 
Regions: The datasets used in the experiments 
cover a broad spectrum of backgrounds and vehicle 
types in urban and rural settings worldwide. 
However, the generalization of the proposed system 
to different geographical regions with distinct 
characteristics (e.g., infrastructure, vehicle types, 
traffic regulations) remains to be thoroughly 
explored. Adapting the system to specific regional 
nuances could improve its performance and 
applicability in diverse contexts. 

Evaluation on Additional Benchmarks: 
While the experimental results demonstrate the 
efficacy of the proposed approach on the VAID, 
VEDAI, and DRL3K_ datasets, evaluation on 
additional benchmark datasets would provide a 
more comprehensive assessment of its 
performance. Utilizing datasets with different 
characteristics and challenges could offer insights 
into the system's strengths and weaknesses in 
varying scenarios. 

Ethical and Privacy Considerations: As 
with any surveillance system, ethical and privacy 
considerations are paramount. Further discussion 
on how the proposed system addresses or mitigates 
potential concerns related to privacy invasion, data 
security, and unintended consequences (e.g., biases 
in decision-making) would be essential for its 
responsible deployment. 


5. CONCLUSION 

In our research, we present a robust system 
designed to identify vehicles in drone aerial photos, 
addressing crucial areas such as smart surveillance 
systems, intelligent traffic monitoring, and efficient 
traffic management. Leveraging the proposed LDA 
model, our innovative traffic monitoring scheme 
significantly enhances the effectiveness of vehicle 
detection. Initially, our methodology employs an 
encoder-decoder module to efficiently segment 
aerial images before precisely identifying various 
vehicles. These vehicles are then categorized using 
linear discriminant analysis, followed by 
optimization of hyperparameters through the 
ARFOA model. Experimental validation on the 
VAID, VEDAI, and DRL3K datasets demonstrates 
the efficacy of our approach, surpassing previous 
state-of-the-art methods. 

The datasets used in our trials are 
dynamic, diverse, and complex, encompassing a 
broad spectrum of backgrounds and vehicle types in 
urban and rural settings worldwide. Our proposed 


varying degrees of success across datasets due to 
the dynamic nature of the scenes, which may 
contain inaccurate vehicle information and cluttered 
backgrounds. Challenges arose in settings where 
objects were partially or completely obscured, such 
as when obscured by trees or overshadowed by 
nearby buildings. 

Moving forward, we aim to further 
enhance traffic surveillance using deep learning 
techniques, focusing on improving vehicle 
recognition and tracking. By addressing these 
challenges, we strive to make vehicle tracking more 
accurate and effective in diverse and challenging 
environments. 
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